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Abstract 

Multiarmed bandit problem is an example of a dilemma between exploration 
and exploitation in reinforcement learning. This problem is expressed as a model of 
a gambler playing a slot machine with multiple arms. A policy chooses an arm so 
as to minimize the number of times that arms with inferior expectations are pulled. 
We propose minimum empirical divergence (MED) policy and prove asymptotic 
optimality of the policy for the case of finite support models. In a setting similar 
to ours, Burnetas and Katehakis have already proposed an asymptotically optimal 
policy. However we do not assume knowledge of the specific support except for the 
upper and lower bounds of the support. Furthermore, the criterion for choosing an 
arm, minimum empirical divergence, can be computed easily by a convex optimiza- 
tion technique. We confirm by simulations that MED policy demonstrates good 
performance in finite time in comparison to other currently popular policies. 

1 Introduction 

The multiarmed bandit problem is a problem based on an analogy with playing a slot 
machine with more than one arm or lever. Each arm has a reward distribution and the 
objective of a gambler is to maximize the collected sum of rewards by choosing an arm 
to pull for each round. There is a dilemma between exploration and exploitation, namely 
the gambler can not tell whether an arm is optimal unless he pulls it many times, but it 
is also a loss to pull an inferior (i.e. non-optimal) arm many times. 

We consider an infinite-horizon i^-armed bandit problem. There are K arms Hi, 
. . . , H-K and arms are pulled infinite number of times. Ilj has a probability distribution Fj 
with the expected value /Xj and the player receives a reward according to Fj independently 
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in each round. If the expected values are known, it is optimal to always pull the arm with 
the maximum expected value fi* = maxj fij. A policy is an algorithm to choose the next 
arm to pull based on the results of past rounds. 

This problem is first considered by Robbins [16]. Since then, many studies have been 
conducted for the problem [21 El [151 UHl [13 [21] . There are also many extensions for the 
problem. For example, Auer et al. [1] removed the assumption that rewards are stochastic, 
and for the stochastic setting, the case of non-stationary distributions [101 [HI 112] , or the 
case of infinite (possibly uncountable) arms [U [T3] have been considered. 

In our setting, Lai and Robbins [T?] established a theoretical framework for determin- 
ing optimal policies, and Burnetas and Katehakis [6] extended their result to multipa- 
rameter or non-parametric models. Consider a model J-", a generic family of distributions. 
The player knows J-" and that Fj is an element of J-". Let Tj{n) denote the number of 
times that 11^ has been pulled over the first n rounds. A policy is consistent on model J-" 
if E[Tj(n)] = o(n") for all inferior arms Ilj and all a > 0. 

Burnetas and Katehakis [6] proved the following lower bound for any inferior arm Ilj 
under consistent policy: 

^■''"K ..W.,o,>Uwi|G) ^°'^')'°^" « 

with probability tending to one, where E{G) is the expected value of distribution G and 
denotes the Kullback-Leibler divergence. Under mild regularity conditions on J-", 



inf D(F\\G)= min D(F\\G) 
GeT■.E{G)>^l G&T■.E{G)>^l 



and we write 



D^i.(F,/i)= niin DiF\\G) 

G€T:E{G)>li 

in the following. 

A policy is asymptotically optimal if the expected value of Tj{n) achieves the right- 
hand side of ([T|) as n — )■ oo. In [T^ and [B], policies achieving the above bound are 
also proposed. These policies are based on the notion of upper confidence bound. It can 
be interpreted as the upper confidence limit for the expectation of each arm with the 
significance level 1/n. 

Although policies based on upper confidence bound are optimal, upper confidence 
bounds are often hard to compute in practice. Then, Auer et al. [3] proposed some 
policies called UCB. UCB policies estimate the expectation of each arm in a similar way 
to upper confidence bound. They are practical policies for their simple form and fine 
performance. Especially, "UCB-tuned" is widely used because of its excellent simulation 
results. However, UCB-tuned has not been analyzed theoretically and it is unknown 
whether the policy has consistency. Theoretical analyses of other UCB policies have been 
given, but their coefficients of the logarithmic term do not necessarily achieve the bound 
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In this paper we propose minimum empirical divergence (MED) policy. We prove the 
asymptotic optimality of MED when the model T is the family of distributions with a 
finite bounded support, denoted by A. This model consists of all distributions with finite 
supports over a given interval, e.g. [—1,0]. It is larger than the model used in [6], which 
assumes a specific finite support. We also demonstrate simulation results of MED policy 
comparable to UCB policies. 

Our MED policy is motivated by the observation of ([T]). When a policy achieving ([T]) 
is used, an inferior arm Ilj waits roughly exp(njDmin(-Fi, /i*)) rounds to be pulled after 
the nj-th play of Ilj. Then, it can be expected that a policy pulling Ilj with probability 
exp(— riiDininl-^i, A^*)) will achieve ([T]). MED pohcy is obtained by plugging -Fj,/i* into 
in -Dmin, where Fj is the empirical distribution of rewards from Ilj and \x* is the 
current best sample mean. 

MED pohcy requires a computation of Dmin(-Fj, /i*) = niinG'g^.E(G)>/i* |G) at each 
round whereas upper confidence bound requires the computation of 

max E(G). (2) 

GeAD„,i„(F,||G)<i^ 

-Dmin and ([2]) are quantity dual to each other but the former has two advantages in practical 
implementation. First, D^i^{Fi, fi*) is smooth in /t* which converges to fi*. Therefore the 
value in the previous round can be used as a good approximation of -Dmin for the current 
round. On the other hand ([2]) continues to increase according to n and it has to be 
computed many times. Second, as shown in Theorem [5] below, -Dmm can be expressed as 
a univariate convex optimization problem for our model A. Although ([2j) is also a convex 
optimization problem, the nonlinear constraint D{Fi\\G) < is harder to handle. 

MED policy is categorized as a probability matching method (see, e.g. [19j for clas- 
sification of policies). In this method each arm is pulled according to the probability 
reflecting how likely the arm is to be optimal. For example, Wyatt [20] proposed prob- 
ability matching policies for Boolean and Gaussian models by Bayesian approach with 
prior/posterior distributions. In our approach the probability assigned to each arm is 
determined by (normalized) maximum likelihood instead of posterior probability. 

This paper is organized as follows. In Section |21 we give definitions used throughout 
this paper and show the asymptotic bound by [6J, which is satisfied by any consistent 
policy. In Section |3l we propose MED policy and prove that it is asymptotically optimal 
for finite support models. We also discuss practical implementation issues of minimization 
problem involved in MED. In Section |U some simulation results are shown. We conclude 
the paper with some remarks in Section O 

2 Preliminaries 

In this section we introduce notation of this paper and present the asymptotic bound for 
a generic model, which is established by [B]. 
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Let J-" be a generic family of probability distributions on R and let Fj E J-' he the 
distribution of 11^, j = 1,...,K. Pf[-] and E^[-] denotes the probability and the ex- 
pectation under F E J^, respectively. When we write e.g. Pf[X ^ ^] ^) 
Ei?[^(X)] {6{-) is a function M — )■ R), X denotes a random variable with distribution F. 
We define F{A) = Pf[X e A] and E(F) = Ef[X]. 

A set of probability distributions for K arms is denoted by -F = (Fi, . . . , F^) G J-"^ = 
J-". The joint probability and the expected value under F are denoted by Pf[-], 
Ejr[-], respectively. 

The expected value of 11^ is denoted by nj = E(Fj). We denote the optimal expected 
value by /i* = ma.Xj fij. Let J„ be the arm chosen in the n-th round. Then 



T,(n) = ^I[J„=j], 



m=l 



where ![■] denotes the indicator function. For notational convenience we write Tj{n) = 
Tj{n — 1), which is the number of times the arm Uj has been pulled prior to the n-th 
round. 

Let Fj^t and fij^t = E{Pj,t) be the empirical distribution and the mean of the first 
t rewards from Uj, respectively. Similarly, let Fj{n) = Fj^T'(n) and fij{n) = fij^T'{n) 
be the empirical distribution and mean of Ilj after the first n — 1 rounds, respectively. 
(i*{n) = maxj fij{n) denotes the highest empirical mean after n — 1 rounds. We call Uj a 
current best if fij{n) = fi*{n). 

Let fl denote the whole sample space. For an event A G Q, the complement of A 
is denoted by A'" . The joint probability of two events A and B under F is written as 
Pf[A n B]. For notational simplicity we often write, e.g., PfIJu = J H Tj(n) = t] instead 
of the more precise PF[{Jn = j} H {Tj{n) = t}]. 

Finally we define an index for F G J-" and /i G M 

Anf(F,/i,^) = ^ inf D{F\\G) 
where Kullback-Leibler divergence D{F\\G) is given by 

D{F\\G)-= 



^F[\og^] f exists, 
+00 otherwise. 



Dinf represents how distinguishable F is from distributions having expectations larger 
than /i. If {G G J-" : E(G') > /i} is empty, we define /i, J^) = +oo. We adopt Levy 

distance L{F, G) for distance between two distributions F, G. We use only the fact that 
the convergence of the Levy distance L[F, Fn) — )■ is equivalent to the weak convergence 
of {Fn} to distribution F and we write F„ — )■ F in this sense. 

Lai and Robbins [H] gave a lower bound for E[Tj(r;,)] for any inferior Ilj when a 
consistent policy is adopted. However their result was hard to apply for multiparameter 
models and more general non-parametric models. Later Burnetas and Katehakis (Gj ahi] 
extended the bound to general non-parametric models. Their bound is given as follows. 
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Theorem 1. |6l Proposition 1] Fix a consistent policy and F G J^^ . IfE{Fi) < fi* and 
< Anf (-^15 J^) < oo, then for any e > 



lim Pjr 

N^oo 



Ti{N) > 



{l-e)\ogN 



1. 



Consequently 



N^oo log AT - D,^i{Fi,fi*,J^) 



(3) 



3 Asymptotically Optimal Policy for Finite Support 
Models 

Let A = {F : |supp(F)| < oo, supp(F) C [a,b]} be the family of distributions with 
a finite bounded support, where supp(F) is the support of distribution F and a,b are 
constants known to the player. We assume a = —1, b = without loss of generality. We 
write supp'(F) = {0} U supp(F) and Ax = {G E A : supp(G') C X} where X is an 
arbitrary subset of [—1,0]. 

We consider ^ as a model J-" and propose a policy which we call the minimum empirical 
divergence (MED) policy in this section. We prove in Theorem [3] that the proposed policy 
achieves the bound given in the previous section. Then, we describe a univariate convex 
optimization technique to compute Z^min used in the policy. 

Note that the finiteness of the support can not be determined from finite samples and 
every policy for A is applicable also for {F : supp(F) C [a, b]}. However our proof of the 
optimality in this paper is for the above A. The advantage of assuming the finiteness is 
that we can employ the method of types in the large deviation technique. This enables 
us to consider all empirical distributions obtained from each arm. 

In this model it is convenient to use 

D^,^{F,^^,A)^ niin D{F\\G) 

Ge^:E(G)>M 

instead of Di^^f^F, fi, A) = miG£A:E{G)>iJ. D{F\\G). Properties of the minimizer G* of the 
right-hand side will be discussed in Section 13.21 

Lemma 2. D^i^{F, fx, A) = Di^{{F, yU, A) holds for all F E A and ^ < 0. 

Proof. We will prove in Lemma [6] that D^i^{F, fi, A) is continuous in < 0. Dmin(-^, 
A) = Di-a{{F, fi, A) follows easily from the continuity. □ 



3.1 Optimality of the Minimum Empirical Divergence Policy 

We now introduce our MED policy. In MED an arm is chosen randomly in the following 
way: 
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[Minimum Empirical Divergence Policy] 
Initialization. Pull each arm once. 
Loop. For the n-th round, 

1. For each j compute Dj{n) = D^i^{Fj{n), fi*{n), A). 

2. Choose arm 11 j according to the probability 

exp(-Tj(n)D,(n)) 

Pjin) = — IP . 

E£iexp(-7;'(n)A(n)) 

Note that 

^<Pj{n)<l (4) 
for any currently best Uj since Dj{n) = 0. As a result, it holds for all j that 

-iexp(-T;(n)D,(n)) <p,{n) < exp(-i;'(n)D,(n)). (5) 

Intuitively, Pj{n) for a currently not best arm 11^ corresponds to the maximum likeli- 
hood that Tlj is actually the best arm. Therefore in MED an arm Ilj is pulled with the 
probability proportional to this likelihood. 

Note that our policy is a randomized policy. Therefore probability statements below on 
MED also involve this randomization. However for notational simplicity we omit denoting 
this randomization. 

Now we present the main theorem of this paper. 

Theorem 3. Fix F G satisfying fij = fi* and fii < fi* for all i j ■ Under MED 
policy, for any i ^ j and e > it holds that 

^Firm] < n rV^l iogiv + o(i). 



Note that we obtain 



lim sup — — — < 



\ogN - D^i^{Fi,fi*,Ay 

by dividing both sides by log N, letting N oo and finally letting e J, 0. In view of ([3]) we 
see that MED policy is asymptotically optimal. We give a proof of Theorem [3] in Section 

m\ 

The following corollary shows that the optimality of MED policy given in Theorem [3] 
is a generalization of the optimality in [6]. 
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Corollary 1. Let X C [—1,0] he an arbitrary subset of [—1,0] such that & X. Fix 
F G satisfying fij = fi* and < ^* for all i ^ j ■ Under MED policy, for any i ^ j 
and e > it holds that 

Ei.[T,(iV)] < „ /J"' ■ logiV + 0(l). (6) 

Proof. We prove in Lemma H] that D^ia{F,fi,A) = D^ia_{F, fi, Asupp'{F))- On the other 
hand, -Dmm(-F, /i, Aupp'(F)) > Drain{F, fi, Ax) holds from Aupp'(F) C Ax- Then we obtain 
(|6]) from Theorem [3l □ 

Note that is achieved also by the policy used in the |S] if is fixed and assumed 
to be known. Our result establishes the same bound without this assumption. 

3.2 Computation of i^min and Properties of the Minimizer 

For implementing MED policy it is essential to efficiently compute the minimum empirical 
divergence D^in{Fj{n), fL*{n), A) for each round. In this subsection, we clarify the nature 
of the convex optimization involved in D^iT^{Fj{n), jl*{n), A) and show how the minimiza- 
tion can be computed efficiently. In addition, for proofs of Lemma |2] and Theorem [3l we 
need to clarify the behavior of D^i^{F, fi, A) as a function of /x. 

First we prove that it is sufficient to consider ^supp'{F) for the computation of Z)min(-F, /x, 

A): 

Lemma 4. D^i^{F,fi,A) = /^minl-^, Z^, Aupp'(F)) holds for any F e A. 

Proof. Take an arbitrary G G A\Asupp'(F) such that E(G') > ^ and G'(supp'(-F)) = p < 1. 
Define G' G ^supp'(F) as 

rG({0}) + (l-p) x = 
G'{{x}) = I G{{x}) x^O,xe supp(F) 

otherwise. 

Since D{F\\G') < D{F\\G) and E(G') > E(G), we obtain 

min D(F\\G)> min D(F\\G'). 

GG.A:E(G)>M G'e^,,pp,(^):E(G')>M 

The converse inequality is obvious from ^supp'(F) C A. □ 

In view of this lemma, we simply write D^^i^^F, ^) instead of D^i^{F, fi, A) = D^i^{F, 
/i, ^supp'(F)) when the third argument is obvious from the context. 

Let M = |supp'(F)| and denote the finite symbols in supp'(F) by xi...,xm, i-e. 
{0} U supp(F) = {xi, . . . ,xm}- We assume xi = and < for i > 1 without loss of 
generality and write fi = 
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Now the computation of D.^i.^{F, fi) is formulated as the following convex optimization 
problem for G = {gi, . . . , qm) from Lemma H) 

M J, 

minimize } fi log — 

M M 

subject to -gi < 0, Vi, (x - ^Xigi < 0, ^gi = l, (7) 

i=l i=l 

where we define log = 0,0 log § = 0, and ^ = +oo. 

It is obvious that G = F is the optimal solution with the optimal value when 
> E(F) > fi. Also G = 6o, the unit point mass at 0, is the unique feasible solution if 
fi = 0. For fi > the problem is infeasible. Since these cases are trivial, we consider the 
case E(F) < /i < in the following. 

Define and its first and second order derivatives as 

M 

h{v) ^ E^[log(l-(X-/x)z/)]= J]/,log(l-(x,-/i)z/), (8) 



i=l 



i=l ^ ' 

h"{v) ^ |i/,(zy) = -y^_M^ii:i^. (10) 

dv^ ^ ' {xi- /i)z/)2 ^ ^ 

Now we show in Theorem [5] that the computation of -Dmin is expressed as maximization 
of hlu). Since hlu) is concave, it is a univariate convex optimization problem. Therefore 
Djnin can be computed easily by iterative methods such as Newton's method (see, e.g., [5] 
for general methods of convex programming). 

Theorem 5. Define Eplfi/X] = oo for the case F{{0}) = fi > 0. Then following three 
properties hold for E(F) < yu < 0.' 
(i) Dmi^{F, fi) is written as 

Dmin{F,fi) = max h{u) (11) 

0<i/< — 
— — -n 

and the optimal solution v* = argmaxQ<;j^< j_ h{v) is unique. 

— M 

In particular for the case E[yu/X] <l,v* = — l//x o-nd ffTTl) is simply written as 

M 

I^min(F,/i) = h{^) = 5^/,l0g(x,//i). (12) 

1=2 

On the other hand for the case E[yu/X] > 1, ffTTj) is written as an unconstrained 
optimization problem 

-Dmin(-F,yw) = max /;,(//). (13) 
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(ii) u* satisfies 

- E(F) 



(iii) Dmin{F, ^) is dijjerentiable in n & (E(F),0) and 

d 

We give a proof of Theorem O in Section 13.31 
3.3 Proofs of Theorem [3] and [5] 

In this section we give proofs of Theorem [3] and [51 Actually we prove Theorem [3] using 
Theorem |5] and prove Theorem [5] independently of Theorem [31 

We first show Lemmas [6] and [7] on properties of -Dmin to prove Theorem [31 

Lemma 6. Dmi^{F,fi) is monotonically increasing in /i and possesses following continu- 
ities: (1) lower semicontinuous in F & A, that is, liminfi?/_^i? Z)min(-F', /i) > D^ainiF, fi) . 
(2) continuous in fi < 0. 

Note that the continuity in yU < is not trivial at /i = E(F) because the differentiability 
in Theorem [5l is valid only for the case E(F) < /i < and Z^min(-^, /^) niay not be 
differentiable at /i = E(F). 

Proof. The monotonicity is obvious from the definition of -Dmin- 

(1) Fix an arbitrary e > 0. From ([TTl) and the continuity of /i(z/), there exists uq G 
[0, — l//i) such that Ep'[log(l — (X — /u)z/o)] > Dj^^j^^F, fi) — e. Then we obtain 

liminf Dmin(-F' u) > liminf Ei?/[log(l — (X — u)z/o)l 

F'^F F'^F ^ rj ji 

= E^[log(l-(X-/i)i/o)] (14) 
> D^^{F,ii)-e. 

Note that log(l — (x — /i)z/o) is continuous and bounded in x G [—1, 0] and f[T4l) follows from 
the definition of weak convergence. The lower semicontinuity holds since e is arbitrary. 

(2) The continuity is obvious for /i > E(F) from the differentiability in Theorem [3 
The case < E(F) is also obvious since D^i^{F, fi) = holds for /i < E(F). Then it is 
sufficient to show 

lim D^UF, li) = D^UF, E(F)) = 0. (15) 

lilE{F) 

From flTTl) and the concavity of /i(z/), it holds that 

h{0)<D^,^{F,fi)<h{0) + h'{0)^ 
^ 0<D^UF,f^)<^-l 
for /i > E(F). (USD is obtained by letting /x | E(F). □ 
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Lemma 7. Fix arbitrary iJ,,n' G (—1,0) satisfying fi' < /i. Then there exists C{n,n') > 
such that 

Dmin 

{F, fx) - D^i^{F, fi') > C{fi,fi'). 
for all F ^ A satisfying E(F) < /i'. 

Proof. Since Drnm(-F, /i) is differentiable in /i > E(F) from Theorem [5l we have 



d 



-M 1 + M 



> / ^^^d« 



/ -/i'(l + /i) 



-2/i'(l + ^) 



(=: c{^i,^i')). 



□ 



Proof of Theorem We define more notation used in the following proof. We fix j = 1 
and let L = {2, . . . , K}. Then, fi* = fii and fik < f^i for k & L. For notational convenience 
we denote Jn(i) = {Jn = i} which is the event that the arm Ilj is pulled at the n-th round. 

We simply write E[-],P[-] as an expectation and a probability under F and the ran- 
domization in the policy. Now we define events An, Bn, Cn, Dn as follows: 

An ^ <!Afn)>^"'"^^"^*^ 



l + e/2 
Bn = {fi'lin) > Hi-S} 

Cn = {fi'lin) < fii — 6 n ma.xfik{n) < fii — 6} 
Dn = {fi'lin) < fii — 6 n max /ifc(n) > fii — 6} 

k£L 

where 5 > is a constant satisfying maxkeL /^fc < /Ui — <^ which is set sufficiently small 
in the evaluation on Bn- Note that i?„ U C„ U Dn = ^ and each I[J„(z)] in the sum 
Ti{N) = J2n=i^['^ri{'i)] is bounded from above by 

I[Jn{t)] < I[Jn{l) n An] + I[Jn{t) H C„] + I[J„(z) H A^ H Bn] + I[Jn(^) H Aj- (16) 

In the following Lemmas IHlfTTl we bound the expected values of sums of the four terms 
on the right-hand side of (|T6l) in this order and they are sufficient to prove Theorem [31 □ 



Lemma 8. Fix an arbitrary e > 0. Then it holds that 

' N 

5^i[j„(2)n A 



E 



n=l 
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Lemma 9. 



N 



E 



J]i[j„(z)nc„ 



0(1). 



n=l 



Lemma 10. 



E 



N 



n=l 



0(1). 



Lemma 11. 



E 



N 



n=l 



0(1] 



Before proving these lemmas, we give intuitive interpretations for these terms. 

An represents the event that the estimator Diiji) = D^jiin{Fi{n) , fi* (n)) of D^i^^Fi, 
/i*) is aheady close to Z)min(Fj, //*) and Hi is pulled with a small probability. After 
sufficiently many rounds An holds with probability close to 1 and the term J2n=i ^[Jnii) H 
An] is the main term of Ti{N). 

Other terms of (1161) represent events that Ilj is pulled when each estimator is not yet 
close to the true value. The term involving C„ is essential for the consistency of MED. 

A'^ n Bn represents the following event: Di{n) has not converged because Fi{n) is not 
close to Fi although fi*{n) is already close to fii. In this event Ilj is pulled and therefore 
Fi{n) is updated more frequently. As a result, A'^ fl Bn happens only for a few n. 

Similarly, represents the event that fik happens to be large for some k E L. Also 
in this event Fk{n) is updated more frequently and Dn happens only for a few n. 

On the other hand, Cn represents the event that fii is not yet close to /ii. It requires 
many rounds for Hi to be pulled since Hi seems to be inferior in this event. Therefore C„ 
may happen for many n. 

Proof of Lemma\^ By partitioning I[J„(i) fl A^ according to the number of occurrences 
Yl^=i ^[Jm{i) n ^m] of the event Jmii) H A^ before the n-th round, we have 



N 



5^i[j„(0nA„] 



n=l 



< 



(1 + e) log AT 



N 



n=l 



C n-1 



\n^=l -t^mml-Tj, ) 



Since II[Jm(«) n Am] < Em=\ ^^[^m(0] = ^c obtalu 



n=l 



n=l 



J„(.) n An n Tlin) > (l±i)i^' 
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Taking the expected value we have 



N 



E 



n=l 



TV 



^ {l + e)\ogN ^ 



< 



< 



(1 + e) logiV 



n=l 



J„(i) n A„ n T'(n) > 



(1 + e) log AT' 



n=l 



^min (-^j ) A'' 

(1 + 6) log AT' 



ii±ili^ + iVex / (l + e)logiVAnin(i^.,/x*) \ 
{Fi,fi*) D^UFi,fi*) l + e/2 y 

+1 



(1 + e) logiV 



_|_ ]Y l+f/2 



(by dSD) 



I^mm(i^i,/X*) 

The lemma is proved since 

jY-i+j72+^ = o(l). 
Proof of Lemma First we have 



□ 



N 



N 



n=l 



5^i[j„(z)nc„] < 5^i[J„GLna] 

n=l 

Af oo 

< 5^^i[J„GLnTi'(n)=t ncj 



(17) 



4=1 n=l 

From the technique of type [3 Lemma 2.1.9], it holds for any type Q E A that 

PFAFi,t = Q]< exp{-tD{Q\\F^)) < exp{-tD^UQ, f^i))- (18) 

Let R = (i?i, . . . , Rm) be the smallest m integers in {n : Tl{n) = t n C„}. i? is well 
defined on the event m < ^^^^ I[J„ G L n T{(n) = t n C^]. Let r = (ri, . . . , r^) e N"' 
be a realization of R. Here recall that we write an event e.g. "■ ■ ■ fl i? = r fl Fi^t = Q" 
instead of "■ ■ ■ fl {R = r} fl = Q}" ■ Then we obtain for any r that 



I[Jn eL n T[{n) = t n Cn]>m} n R = r n Fi^t = Q 

m 

f|{^ EL} n R = r n Fi^t = Q 



. n=l 



p 



.1=1 



PFAFi,t = Q]l[\P 



1=1 



l-l 



Ri = ri\ f]{Jr, EL n Rk = n} n F^^t = Q 



xP 



k=l 

l-l 



Jr^EL\Ri = ri n p|{Jrfe eL n Rk = n} n A,* = Q 



k=l 
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< PFAFi,t = Q]l[[P 



1=1 



Ri = ri\ []{Jr, eln Rk = rk} n F^^t = Q 



k=l 



X ( 1 - j^exp{-tD^in{Q, /ii - 6)) 



(by and fi*{Ri) < fii - 6) 



PfAFu = Q] ( 1 - -exp(-tZ}^i„(g, /ii - 5)) 



X 



1=1 



Ri = ri^ f]{Jr, eL n RkEri} n Fi^t = Q 

k=l 



By taking the disjoint union of r, we have 



P 



^I[J„ G L n T[{n) = tr\Cn]>m \ n = Q 



n=l 



< Pf, = Q](1-- exp(-tD^i,(Q, /ii - 5)) 



(19) 



Then we have 
E 



^I[J„ G L n T[{n) = t n Cn 

oo 

^ i[j„, e L n T[{n) = t n Cn]>m} n = g 

n=l 

< 5Z ^exp(-tAnm(Q,/Ul)) U - -^exp(-tDmm(<5, /"l -5)) j 



Q:E(Q)<Ati-5m=l 



(by ([HD and ([19])) 

< K ^ exp - t(Amn(Q,/Ul) - ^min(Q, /Ul - 5))) 



Q■.E{Q)<^ll~5 



< K exp(— tC(;Ui,yUi — 5)) (by Lemma [7]) 

Q:E(Q)</ii-<5 

< K(t + l)l^"PP(^^)lexp(-tC(/ii,/ii -5)). 



(20) 



The last inequahty holds since there are at most (t + 1)I^"pp(^i)I combinations as a type of 
t samples from Fi. 

Finally we obtain from ([IT]), and C(/ii,/ii - 5) > that 



E 



N 



J]i[j„(Ona 



n=l 



TV 

<J2^it + l)''^PP^^'^' exp(-tC(/ii, /ii - 5)) = 0(1) 

t=i 



and the proof is completed. 



□ 
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In the proofs of remaining two lemmas, we use [71 Theorem 6.2.10] on the empirical 
distribution: 

Theorem 12 (Sanov's Theorem). For every closed set T of probability distributions 

lim sup- log Pp [Ft e r] < - inf D{G\\F). 

t^-oo t Ger 

where Ft is the empirical distribution oft samples from F. 

Proof of Lemma [73 We apply Sanov's Theorem with F = Fi and 

T = {GeA: L{Fi,G) > 6i} 

where 5i > is a constant. Since infcer > 0, there exists a constant Ci > 
such that 



PFAhteT]<exp{-Cit) (21) 



for sufficiently large t. 
Now we show 



{A^ n C {F,in) e T} (22) 



or equivalently {Fi{n) ^ F fl C A„ for sufficiently small 5i. If Fi{n) ^ Fi and 
then 

D^i^{Fi{n),fi*{n)) > D^^F^iin) , fi* - 6) 

from fi*{n) > fii{n) > Hi — 5 = fi* — 6 and the monotonicity of -Dmm in A*- Since 
Draini.Fi, II* — 5) > 0, for Sufficiently small 5i we obtain 

Drain{Fi{n), - 6) > f^^/a 

from the lower semicontinuity in F of D^ain in Lemma [61 Moreover, from the continuity 
of -Dmin in fi, it holds for sufficiently small 6 that 



1 + e/3 - 1 + e/2 
Then An holds and f l22|) is proved. 
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From ( 122|) we obtain 

N 

f 



n=l 



c 



n=l 

t=i 



■ N 

U {^«(^) 

n=l 



n T!in) = t n G r 



> m 



> m 



N 



l=m 



23|) follows because there is at most one n such that J„(i) fl Tj(n) = t. 
Finally, from ( 12T|) and fl2^ we obtain 



E 



TV 



n=l 



N 



N 



m 



n=l 



m=l 
N N 

m=l l=m 

= 0(1). 



Proof of LemmallJl First we simply bound Yln=i '^['^nii) H -D„] by 

N oo 

^i[j„(z)nD„]< J] 

n=l ra=l 

Since D„ C {Jk^Lif^kin) = fi*{n) > /ii — 5}, it holds that 

oo oo 
Y.l[Dn] < J]J]l[/ifcH=/i*H>/Xi-5] 

k&L n=l 

oo oo 



n=l 



fcgL t=l n=l 



(23) 



(24) 



□ 



(25) 



Now we use a reasoning similar to ( IT9|) . Let = (i?i, . . . ,Rm) be the smallest m 
integers in {n : T^(n) = t fl fik^t = /i*(^) > /ii — 5}. -R is well defined on the event 
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m < Yl'^=i '^['^ki^) = t ^ i^k,t = > /^i — i^]- Then we have 



P 



n=l 



^iKH = t n iik,t = > m 



n=l 

m— 1 



n iJ^^ ^ ^} 



1=1 



m— 1 



= Pf^ [h,t > /ii - 5] P 

< Pf^ [h,t > fii-6]P 

< PFAKt>fii-s] {^-^^ 

from fik{Ri) = fj'*{Ri) and (jlj). Therefore we obtain 

oo 



h,t > 111- 5 



fJ'k,t > 111 - S 



E 



n=l 



m=l 



5^I[T;(n) = t n fik,t = li*{n) > /ii - 5] > m 

_n=l 

< K PF,[iik,t > fii - 5]. (26) 

On the other hand, it holds from Sanov's theorem that for a constant C2 > 

PFAh,t > ^ll-6] = 0(exp(-C2t)) (27) 

by setting F = Fk a.nd T = {G e A : E(G) > ^ll - 5}. From ([25]), ([26]) and ([27]), we 
obtain 



E 



n=l 



< 5^5^irO(exp(-C2t)) 

feGL t=l 

= 0(1). 



□ 



Proof of Theoreml5[ (i) /;,"(//) = holds only for the degenerate case that fi = 1 at Xi = fi 
and this case does not satisfy the assumption E(F) < fi. Therefore h'^u) < and /i(z/) is 
strictly concave, u* is unique from the strict concavity. 

Now we show (ITT]) . ( [T2]) and ( IT3]) by the technique of Lagrange multipliers. The 
Lagrangian function for ([7]) is written as 



M 



^ /i log — - ^ X^gi + i^[li-Yl XiQi j +i^gi. 



M 



M 



M 



i=l 



1=1 



i=l 



i=l 
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Then there exists a Kuhn- Tucker vector (AJ, ■ ■ ■ , A^^, u* , ^*) for the problem ([7]) from [T71 
Theorem 28.2]. On the other hand it is obvious that the problem ([7]) has an optimal 
solution G* = {gl, ■ ■ ■ ,glj). From [T71 Theorem 28.3], {g^, ■ ■ ■ ,g\j) is an optimal value 
and (A^, ■ ■ ■ , A^, z/*, ^*) is a Kuhn- Tucker vector if and only if the following Kuhn- Tucker 
conditions are satisfied: 

-4-\*-^«^^* + r = o, vz 

g* >0,\i> 0, gA, = 0, V^, 

M / M 



Xig* > /X, z/* > 0, z/* I /i - ^ Xig* j = 0, 

i=l 
M 

Eft- = 1- 



i=l \ 1=1 

M 



2=1 



First we consider the case Eplft/X] < 1. In this case, it is easily checked that 



nit 



z^l 



1 _ ■SpM 



A* = 0, z/* = — and ^* = satisfy Kuhn- Tucker conditions since /i = and fi>0 
for i 1. Therefore (|T2|) is obtained. (|TT]) follows from h\—l/ii) > and the concavity 
of /i(z/). 

Now we consider the second case Ep[fi/X] > 1. Since h'{0) > 0, h'(—l/fi) < and 
/i(z/) is concave, 

max /i(z/) = max /i(z/) (28) 

0<i/<— 

— — M 

holds and u* = argmaxQ<;j^<_i / /i(z/) satisfies 



M 



i=l 



- {Xi - fi)u* 



From (1291) we obtain 



1 — (Xj — ;U)Z/* 1 — (Xj — /i)z/* 1 — (Xj — /i)z/* 

and 

Mr M Mr 

1 — (Xj — /i)z/* 1 — (Xj — ^jU* ^—^ 1 — (Xj — ^)v* 
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From fl30l) and fl3T|) . it is easily checked that 



A* 



^0 /. = 0, 

'o /• > 

l-{xi- fi = 0, 



^* = 1 + /i.z/* and I/* satisfy Kuhn- Tucker conditions and (fTT!) is obtained. ( !T3|l follows 
immediately from f l28p . 

(ii) The claim is obviously true for the case E^lfi/X] < 1 and we consider the case 
Ef[i^/X] > 1. 

Define 

W{X, v) ~ 



1 — (x — 

For any fixed v G [0, — w(a;, i/) is convex in x G [—1, 0]. Therefore 

M 

h'i^) = -^fiwixi,iy) 

i=l 
M 

^ -5^/i(-a:iw(-l,z/) + (l+Xi)^(0,z/)) 

i=l 

= E(F)w(-l,z/) - (l + E(F))«;(0,z/). (32) 
The right-hand side of fl52]) is for u = {fi — E(F))/(— + yu)) and therefore 

^ ^ -^(1 +^)) ~ ^' 



Since /i'(z/) is monotonically decreasing, z/* > (yU — E(F))/ (— + /i)) is proved, 
(iii) It is obvious that -^D^ia{F,u) = u* = — for E^[/i/X] < 1 and 

1 . -^min 

40 e — 

for EF[^^/X] = 1. 

Define Z)^jn(F, /x) = maxj, /i(z/). Then D^i^{F, fi) = /i) for the case Ei7[/i/X] > 

1. From Corollary 3.4.3], D'^^^{F, ^) is differentiable in fj, with 



|b;.,.(f,.)^|-m.) 



= z/ 



Therefore we obtain 



—D^,^{F,u) = —D'^,^{F,u) = u* 
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for Eir[/i/X] > 1 and 



lim 



Dmin{F, /X - e) - -Dmin(-F, /i) 



lim 

40 



D'^^{F,fi-e)-D'^,^{F,fi) 



— e 



— e 




for Ef[h/X] = 1. 



□ 



4 Experiments 

In this section, we present some simulation results on our MED and UCB policies in [3]. 

First we give an algorithm for computing u* and Dj^^^^lF, fi) with parameters r, i/q; 
which we denote by D^i^{F, fi;r,h'Q). Here r is a repetition number and z/q is an initial 
value of u for the optimization in Theorem O Recall that h, h', h" are defined in ([S]), ([2]) 
and ([ID]). 

[Computation of Dmin(-F, /i; r, i/q)] 
Require: r > 0, uq > 0; 

if /i = Oand^Ei^i J<lthen 



end if 



if 1^0 £ (^) ^) then 

u := z/q; 
end if 

for t := 1 to r do 
if h'{u) > then 

u := u; 
else 

z7 := u] 
end if 

■= U -h'{u)/h"{uy, 

if z/ ^ (z/, z7) then 

z/ := 
end if 
end for 

return [max^>(,{j,^-p^^} ^u'), argmax^,g|j,_j7;,| /i(z/')); 

In this algorithm, a lower and an upper bound of z/* are given by u and V, respectively. In 
each step, the next point is determined based on Newton's method by z/ := v—h'{v)/h"{i'). 
When V does not improve the bounds z/, z7, the next point is determined by bisection 
method, v := (z/ + z7)/2. The complexity of the algorithm is given by 0(r |supp(F)|). 



return ( h 
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The complexity 0(r |supp(F)|) is not very small when |supp(F)| is large. Especially 
it requires 0{rTi{n)) O(rlogn)) computations when it is adopted for a continuous 
support model since |supp(Fj f)| < t. On the other hand, D^i^{F, fi) is differentiable in 
fi (with slope u*) and the argument /i converges to fi* after sufficiently many rounds. 
Therefore it is reasonable to approximate Dmin(-P', /^) by past value of D^i^{F, ft; h'Q,r) 
until the variation of /i is small. In this point of view, we implemented our MED policy 
for our simulations in the following way: 

[An implementation of MED policy] 
Parameter: Integer r > and real d > 0. 
Initialization: 

1. Pull each arm once. 

2. Set (A, i^i) ■■= Dm\n{,Ki, iji'*{K + 1); 0,r) and rrii := jl*{K + 1) for each 

Loop: For the n-th round, 

1. Update variables for each i: 

• If Jn-i 7^ i and \fi*{n) — nii\ < d then Di := t)i + i'i{fi*{n) — mj). 

• Otherwise (A, i^i) ■= D^i^{Fi{n), fi*{n); Ui,r) and := fii{n). 

2. Choose arm Ilj according to the probability 

Pj(n) = — — . 

Ef=iexp(-7;'HA) 

Now we describe the setting of our experiments. We used MED, UCB-tuned and 
UCB2. Each plot is an average over 1,000 different runs. The parameter a for UCB2 is 
set to 0.001, the choice of which is not very important for the performance (see |3]). First 
we check the effect of the choice of the parameters r and d. Then MED and UCB policies 
are compared. 

In the following simulations, we use the model where the support is included in [0, 1]. 
Note that in the computation of Dmin(-F', At; J^, r) we assumed that the support is included 
in [—1,0] for computational convenience. Then, all rewards are passed to computation 
after 1 is subtracted from them in MED. 

Table [T] gives the list of distributions used in the experiments. They cover vari- 
ous situations on the computation of -Dmin and how distinguishable the optimal arm 
is. Distributions 1-4 are examples of 2-armed bandit problems. In Distribution 1, z/* > 
(yU — E(F))/(— yu(l + /i)) in Theorem [5] always holds with equality since supp(Fj) C {0, 1}. 
Therefore the exact solution can be obtained by D^amiF, fi; z/, r) regardless of r. Also 
in Distribution 2, D^i^{F, fi; Ujr) does not require the repetition after sufficiently many 
rounds since Ei?2[/ii/A] < 1. On the other hand in Distribution 3, the maximization (1131) 
is necessary in almost all rounds since Ei^J/zi/X] > 1. Distribution 4 is an example of 
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Table 1: Distributions for experiments. 



Distribution !■ 








Fifl0l-) = 45 Fi fill) = 55 






= 0.55 


Fof-fOl-'l — 55 Foflll'l — 45 






— 45 


Distribution 2" 

J — y -Lk_/ ux X iw/ uxvyxx ■ 








F^ CIO 4'!-l = 5 Fi do 81-') = 5 






= 0.6 








= 0.4 


Distribution 3* 

J—' xkj ux X iw/ LA UXV-/XX ■ 








Fi({x}) = 0.08 for X = 0, 0.1, • • 


,0.9,Fi({l})=0.2 


E(F2) 


= 0.56 


Fo(ix\) = — for X = 1 • • • 


9 1 




= 0.5 


T^icitriHntinTi 4* 

J_/lO 111 lUlA LlUll t:. 








M(l0l) = 99 Fi fill) = 01 




E(Fi) 


= 0.01 


FoflO 0081) =05 FoflO 0091) = 


0.5 


EfFo) 


= 0085 


TOistriHiitioTi 5* 

J — y XU UX X KJ KA. UXV^XX \J • 








Fi dxl) = 08 for X = 1 • • 


9 Fi (ill) =02 


EfFi) 

\ 1 / 


= 0.56 


FAix\) = — for X = 1 • • • 


9 1 


EfF) 
for i - 


= 0.5 
= 2,3,4,5 


Distribution 6: 








Fi = Be(0.9,0.1) 




E(Fi) 


= 0.9 


F2 = Be(7,3) 




E(F2) 


= 0.7 


F3 = Be(0.5,0.5) 




E(F3) 


= 0.5 


F4 = Be(3, 7) 




E(F4) 


= 0.3 


F5 = Be(0.1,0.9) 




E(F5) 


= 0.1 



a difficult problem where the optimal arm is hard to distinguish since the inferior arm 
appears to be optimal at first with high probability. Distribution 5 and 6 are examples 

of more general problems where the numbers of arms K and the support sizes arc large. 
Be(a;,/3) {a, (3 > 0) in Distribution 6 denotes beta distribution which has the density 
function 

^a-l(l_^)/3-l 

B(.,/3) ^or.eM 

where B(a, /3) is beta function. Note that beta distributions have continuous support and 
are not included in A and therefore the performance of MED is not assured theoretically. 
However, MED is still formally applicable since the supports are bounded. 

The labels of each figure are as follows, "regret" denotes (//* — /ij)Ti(n), which 
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Figure 1: Comparison between different parameters of MED. 



is the loss due to choosing suboptimal arms. "% best arm played" is the percentage that 
the best arm is chosen, that is, 100 x Ti{n)/n in these problems. "Dmin" stands for 
the asymptotic bound for a consistent policy, ^j.^.<^*(/^* — /^i) logn/Dmin(-Fi, /U*). The 
asymptotic slope of the regret (in the semi-logarithmic plot) of a consistent policy is more 
than or equal to that of "Dmin" . 

Figure [1] shows an experiment on the choice of the parameters r and d of MED for 
Distribution 3. Our implementation of MED approaches the ideal MED as d — )■ and 
r — )■ oo. However, we see from the figure that the performance is not sensitive to the 
choice of r, d. This may be understood as follows: (1) the linear approximation for the 
case \fl*{n) — rriil < d is accurate, (2) the initial value z/j in Dmin (-^1(^)5 fi'*{n)] i/j, r) seems 
to be a good approximation of u* and the repetition number does not have to be large. 
We use r = 2 and d = 0.01 in the remaining experiments based on this result. 

Now we summarize the remaining experiments on the comparison of the policies (Fig- 
ure 2-7). 

• MED always seems to be achieving the asymptotic bound even for continuous sup- 
port distributions, since the asymptotic slope of the regret is close to that of "Dmin" . 

• MED performs best except for Distribution 1 where MED performs worst. However, 
the consistency of UCB-tuned is not proved unlike MED and UCB2. It appears that 
UCB-tuned might not be consistent, because the asymptotic slope of T2(n) seems 
to be smaller than that of "Dmin". Note that the theoretical logarithmic term of 
the regret is very near between MED and UCB2 for Distribution 1 (4.983 log n and 
5.025 log n, respectively). Therefore this result can be interpreted as follows: MED 
achieves the asymptotic bound but needs some improvement in the constant term 
of the regret compared to UCB2. 
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MED 




-A- 


UCB-tuned 






UCB2 






Dmin 






10000 100000 



MED 

UCB-tuned 
UCB2 



10 100 1000 10000 100000 



plays plays 

Figure 5: Simulation result for Distribution 4 (very confusing distributions). 





10 100 1000 10000 100000 



10 100 1000 10000 100000 



Figure 6: Simulation result for Distribution 5 (5 arms with a wide support). 



MED 
-A- UCB-tuned 
X UCB2 
— Dmin 




10 100 1000 10000 100000 

plays 




100 1000 10000 100000 

plays 



Figure 7: Simulation result for Distribution 6 (beta distributions) . 
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5 Concluding remarks 



We proposed a policy, MED, and proved that our policy achieves the asymptotic bound 
for finite support models. We also showed that our policy can be implemented efficiently 
by a convex optimization technique. 

In the theoretical analysis of this paper, we assumed the finiteness of the support 
although MED worked nicely also for distributions with continuous bounded support in 
the simulation. We conjecture that the optimality of MED holds also for the continuous 
bounded support model. In addition, there arc many models that -Dmin can be computed 
explicitly, such as normal distribution model with unknown mean and variance. We expect 
that our MED can be extended to these models. Furthermore, our MED is a randomized 
policy and the theoretical evaluation of the expectation inchidcs randomization in the 
policy. We may be able to construct a deterministic version of MED. 

In addition to the above theoretical analyses, it is also important to consider the finite 
horizon case. Then it is necessary to derive a finite-time bound of MED for this case. 
Especially, MED policy itself should be improved when the number of rounds is given in 
advance. In this setting, the value of "exploration" becomes smaller and a current best 
arm is to be pulled more often as the number of remaining rounds becomes smaller. 
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